Final Project

Data Science for Public Policy

Author

Madeleine Adelson, Stephanie Jamilla, Jamie Jelly Murtha

About This Project

Urban environments tend to retain heat because they have less tree/vegetation cover, more impervious surfaces such as asphalt, and other differentiating factors compared to those experienced in non-urban environments. Using data on DC’s landcover, (Urban Tree Canopy by Census Block Group in 2020) we built and tested three different models to predict where hotspots are most likely to occur within the District. We also included several variables from the American Community Survey (ACS) to test whether demographic characteristics could also be predictive of hotspots. Our analysis is at the block group level.

The main outcome variable is the average evening temperature, which is part of the DC landcover dataset. This was created by the Sustaining Urban Places Research Lab (SUPR) at Portland State University in 2018. While we attempted to create our own temperature data leveraging remote sensing raster data from satellites like MODIS, the publicly available data was at too large of a resolution to capture different temperatures in each block group the way that SUPR was able to accomplish. This data limitation, combined with how the DC government–our intended end-user–has employed this specific SUPR temperature variable within its own 2020 dataset, has led us to leverage this variable as our key outcome variable. Notably, DC has chosen to use average evening temperatures (instead of morning or afternoon temperatures) in their landcover dataset. We believe this is because surface and air temperatures tend to be vastly different during the daytime but more similar at night–thus potentially accounting for a more consistent temperature experience. Additionally, built urban areas that have a high concentration of impervious surfaces like roads and concrete cool much more slowly at night compared to non-built areas, heightening the temperature disparity and emphasizing the urban heat island effect of cities.

Setup

Loading packages:

library(dotenv)
library(tidyverse)
library(ggplot2)
library(sf)
library(tidycensus)
library(stringr)
library(magrittr)
library(jsonlite)
library(httr)
library(tidymodels)
library(knitr)
library(kableExtra)
library(data.table)
library(patchwork)
library(vip)
library(Metrics)

# Clone the private github repository for this assignment using SSH link:
# git@github.com:stejamilla/final-project.git

# do not use scientific notation
options(scipen = 999)

Prepare the Data (Loading and Cleaning)

DC Land Cover Data

Load data from Open Data DC (landcover, ward geometry):

# Load Urban Tree Canopy by Census Block Group (2020) data
landcover <- read_sf(paste0("data/Urban_Tree_Canopy",
                            "_by_Census_Block_Group_in_2020.shp")) |>
  rename(block_group = GEOID) |>
  mutate(block_group = as.character(block_group)) |>
  mutate(OBJECTID = as.character(OBJECTID)) |>
  rename_all(tolower) |>
  relocate(block_group, .after = objectid) |>
  # Converting UHI to Fahrenheit
  mutate(uhi = (uhi*(9/5) + 32))

# Load DC Ward data from Urban Tree Canopy by Ward (2020)
wards <- read_sf("data/Urban_Tree_Canopy_by_Ward_in_2020.shp") |>
  rename_all(tolower) |>
  select(ward) |>
  st_transform(crs = st_crs(landcover))

Categorize each block group by ward:

# Spatial join so that each block group is assigned a ward based on majority
# area of a block group within a ward
landcover <- st_join(landcover, wards, largest = TRUE)

# Map the overlapping wards & block groups
landcover |>
  mutate(ward = as.character(ward)) |>
  ggplot() +
  geom_sf(aes(fill = ward)) +
  theme_void() +
  labs(title = "District of Columbia Census Block Groups, by Ward",
       subtitle = "2020",
       caption = paste0("Source: Census American Community Survey, 2016 - 2020",
                        "\nDC Open Data Urban Tree Canopy by Census Block",
                        " Group, 2020"),
       fill = "Ward") 

If a ward intersected a block group, we classified that block group as a ward. If a block group overlapped multiple wards, we selected the ward that contained the largest portion of the block group’s area to serve as that block group’s ward for the study.

Census Demographic Data

Load ACS 5-yr 2016-2020 variables:

acs_variables <- load_variables(2020, "acs5") |>
  filter(geography == "block group")

Load select Census ACS 5-yr 2016 - 2020 variables via API

# Secure Census API key (from .env file) for use in Census API call
load_dot_env()
credential <- Sys.getenv("census_api_key")

# Create url for API call
url <- str_glue(paste0("https://api.census.gov/data/2020/acs/acs5?get=",
             # Description of data point
             "NAME,",
             # TOTAL POPULATION ESTIMATE
             "B01003_001E,",
             # TOTAL POPULATION BY RACE - WHITE ALONE
             "B02001_002E,",
             # TOTAL POPULATION BY RACE - BLACK OR AFR AMER ALONE
             "B02001_003E,",
             # TOTAL POPULATION BY RACE - AMER IND AND AK NAT ALONE
             "B02001_004E,",
             # TOTAL POPULATION BY RACE - ASIAN ALONE
             "B02001_005E,",
             # TOTAL POPULATION BY RACE - NAT HAW OR PAC ISL ALONE
             "B02001_006E,",
             # TOTAL POPULATION BY RACE - OTHER ALONE + TWO OR MORE RACES
             "B02001_008E,",
             # TOTAL HISPANIC OR LATINO
             "B03003_003E,",
             # TOTAL POP WITH DOCTORATE DEGREE
             "B15003_025E,",
             # TOTAL POP WITH PROFESSIONAL SCHOOL DEGREE
             "B15003_024E,",
             # TOTAL POP WITH MASTER'S DEGREE
             "B15003_023E,",
             # TOTAL POP WITH BACHELORS DEGREE
             "B15003_022E,",
             # TOTAL POP WITH ASSOCIATES DEGREE
             "B15003_021E,",
             # TOTAL POP WITH REGULAR HIGH SCHOOL DIPLOMA
             "B15003_017E,",
             # TOTAL POP WITH GED OR ALTERNATIVE CREDENTIAL
             "B15003_018E,",
             # MEDIAN HH INCOME IN PAST 12 MOS
             "B19013_001E,",
             # TOTAL HOUSING UNITS
             "B25002_001E,",
             # TOTAL OCCUPIED HOUSING UNITS
             "B25002_002E,",
             # TOTAL VACANT HOUSING UNITS
             "B25002_003E,",
             # MEDIAN PROPERTY VALUE
             "B25077_001E,",
             # GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
             # TOTAL
             "B25070_001E,",
             # GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
             # 30 TO 34.9 %
             "B25070_007E,",
             # GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
             # 35 TO 39.9 %
             "B25070_008E,",
             # GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
             # 40 TO 49.9 %
             "B25070_009E,",
             # GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
             # 50 % or more
             "B25070_010E",
             # Limit to block-group-level data for DC
             "&for=block%20group:*&for=tract:*&in=state:11&in=county:001"))

# Use url to request data from API
acs_json <- GET(url = url)

# Check call status
http_status(acs_json)
$category
[1] "Success"

$reason
[1] "OK"

$message
[1] "Success: (200) OK"
# Save API JSON response as text
acs_json <- content(acs_json, as = "text")

# Save JSON as character matrix
acs_matrix <- fromJSON(acs_json)

# Convert matrix to tibble
acs_data <- as_tibble(acs_matrix[2:nrow(acs_matrix), ],
                      .name_repair = "minimal")

# Add variable names to tibble
names(acs_data) <- acs_matrix[1, ] |>
  str_replace(" ", "_")

Label and clean select ACS variables for model, or to be used in new measure variables for model:

# Clean up ACS data
acs_data <- acs_data |>
  # Meaningful variable names
  rename(total_pop = B01003_001E,
         race_white = B02001_002E,
         race_black = B02001_003E,
         race_ai_an = B02001_004E,
         race_asian = B02001_005E,
         race_nh_pi = B02001_006E,
         race_oth_mult = B02001_008E,
         race_eth_hisp = B03003_003E,
         ed_doct = B15003_025E,
         ed_prof_deg = B15003_024E,
         ed_master = B15003_023E,
         ed_bach = B15003_022E,
         ed_assoc = B15003_021E,
         ed_ged = B15003_018E,
         ed_reg_hsd = B15003_017E,
         med_hhi = B19013_001E,
         med_prop_val = B25077_001E,
         rent_burden_tot = B25070_001E,
         rent_burden_30_34 = B25070_007E,
         rent_burden_35_39 = B25070_008E,
         rent_burden_40_49 = B25070_009E,
         rent_burden_50plus = B25070_010E,
         total_hous_units = B25002_001E,
         vac_units = B25002_003E,
         occ_units = B25002_002E,
         block_group_temp = block_group) |>
  # Create block_group column to use for combining this with the other datasets
  mutate(block_group = paste(state,
                             county, 
                             tract, 
                             block_group_temp,
                             sep = "")) |>
  # Convert Census variables to numeric
  mutate_at(vars(total_pop:rent_burden_50plus), as.numeric) |>
  rename_all(tolower) |>
  relocate(block_group, .before = name)


# Convert rent burden variables into a percentage of rent burdened households,
# then drop original rent burden variables
acs_data <- acs_data |> 
  mutate(
    rent_burden_sum = 
      rent_burden_30_34 + 
       rent_burden_35_39 + 
       rent_burden_40_49 + 
       rent_burden_50plus
    ) |>
  mutate(
    rent_burden_pct = (rent_burden_sum / rent_burden_tot)
    ) |>
  select(-c(rent_burden_30_34, 
            rent_burden_35_39, 
            rent_burden_40_49, 
            rent_burden_50plus,
            rent_burden_tot,
            rent_burden_sum)
         )

When selecting which ACS variables to use, we wanted to include various characteristics of block groups. We chose to include total population and racial disaggregation of population to understand how the volume of population could affect temperature and if demographics of that population could be an important predictor. Levels of education are another way to understand the population in the block group, especially if people of similar levels of educational attainment live close to each other. Median household income and rent burden are economic indicators and consider how economic conditions could affect the conditions of a block group. Lastly, number of housing units, vacant units, occupied units, and median property value are all economic characteristics of the block group (rather than their residents), that may contribute to the physical environment of that block group.

Data Imputation Part 1

A few of our Census variables (rent burden, median household income, median property value) are missing data for some block groups. This seems to be due to insufficient response rates in those block groups: the standard error of the estimate was larger in magnitude than the estimate itself, so the Census Bureau suppresses that estimate.

Nearby block groups are likely to share similar characteristics in terms of property values, median income, and rent burdens, since those tend to reflect neighborhood characteristics that exist at a larger geography than the block group. Therefore, we came up with the following imputation plan:

  1. If data are missing at the block group level, the NA block group is assigned the mean value of the variable from neighboring block groups who share a census tract.

  2. If data are still missing at the census tract level, use the mean for neighboring tracts in the corresponding ward. (This happens in the “Data Imputation Part 2” section below.)

We recognize that this method is imperfect. First, ward-level imputation is less precise since wards are larger geographies. Second, the method may give too much “weight” to block groups that share a Census tract with an NA block group. For example, if within a ward we had to impute data for a couple block groups based on tract-level averages and then imputed data for other block groups based on the ward-level average, then that means the ward-level average partially was influenced by other already-imputed values (using tract-level averages).

However, there were only a handful of instances of ward-level imputation for the first two variables:

  • Rent burden: 7 block groups are imputed at the ward level

  • Median HHI: 5 block groups ” ”

And a fairly small number for property value (less than 5% of observations):

  • Median property value: 27 block groups imputed at the ward level

Another limitation is that median househould income and median property value variables are top- and bottom-coded, meaning that imputed values (based on averages) may be lower or higher than they should be due to the top/bottom coding.

Imputation Part 1: Replace missing data points with tract-level data where available.

# Create tract dataframe
tract_data <- acs_data |>
  mutate(med_hhi = if_else(
    med_hhi < 0, NA, med_hhi),
    med_prop_val = if_else(
      med_prop_val < 0, NA, med_prop_val)
  ) |>
  group_by(tract) |>
  summarize(
    rb_tract = mean(rent_burden_pct, na.rm = TRUE),
    hhi_tract = mean(med_hhi, na.rm = TRUE),
    propval_tract = mean(med_prop_val, na.rm = TRUE)
  )

# Replace missing data with tract data:
acs_data <- acs_data |>
  left_join(tract_data, by = "tract") |>
  # Rent burden
  mutate(
    # Create indicator for imputed vars
    rb_imputed = if_else(
      is.na(rent_burden_pct), 1, 0),
    # Tract-level imputation
    rent_burden_pct = if_else(
      is.na(rent_burden_pct), rb_tract, rent_burden_pct)
  ) |>
  # Median HHI
  mutate(
    # Create indicator for imputed vars
    hhi_imputed = if_else(
      med_hhi < 0, 1, 0),
    # Tract-level imputation
    med_hhi = if_else(
    med_hhi < 0, hhi_tract, med_hhi)
  ) |>
  # Median property value
  mutate(
    # Create indicator for imputed vars
    propval_imputed = if_else(
      med_prop_val < 0, 1, 0),
    # Tract-level imputation
    med_prop_val = if_else(
    med_prop_val < 0, propval_tract, med_prop_val)
  )

Create the Final Dataset for Analysis

Determine Threshold for Hotspot vs. Not Hotspot

Decide the threshold for a hotspot:

# Visualize a cumulative histogram of counts of UHI values
landcover |>
  group_by(uhi) |>
  ggplot(aes(uhi)) +
  geom_histogram(aes(y = cumsum(..count..))) +
  labs(title = paste0("Distribution of Urban Heat Index (UHI) Variable",
                      " District of Columbia"),
       subtitle = "2020",
       caption = paste0("\nSource: DC Open Data Urban Tree Canopy by Census",
                        "\nBlock Group, 2020, and Portland State University")) +
  xlab("Urban Heat Index (UHI)") +
  ylab("Frequency")

# Calculate / visualize frequency & cumulative frequency
heat_freq <- landcover |>
  count(uhi)

heat_freq$cumulative <- cumsum(heat_freq$n)

# Calculate the standard deviation and the mean of uhi
heat_freq |> summarize(
  sd(uhi),
  mean(uhi)
)
Simple feature collection with 1 feature and 2 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 389616.6 ymin: 124877.9 xmax: 407860.4 ymax: 147545.8
Projected CRS: NAD83 / Maryland
# A tibble: 1 × 3
  `sd(uhi)` `mean(uhi)`                                                 geometry
      <dbl>       <dbl>                                            <POLYGON [m]>
1      1.80        89.8 ((394387 136007.6, 394387.7 136010.7, 394388 136012.3, …
# Calculate how many block groups have one standard deviation above the mean
heat_freq |>
  filter(uhi >= 91.6) |>
  count()
Simple feature collection with 1 feature and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 395375 ymin: 134605.6 xmax: 401893.4 ymax: 143306.4
Projected CRS: NAD83 / Maryland
# A tibble: 1 × 2
      n                                                                 geometry
* <int>                                                       <MULTIPOLYGON [m]>
1    82 (((395651.3 136857, 395651.4 136859.2, 395657.1 136859.3, 395664.3 1368…

The threshold for “hotspot” is having a uhi (average evening temperature) of 91.6 degrees Fahrenheit, which is one standard deviation (1.8 degrees) above the mean. This means that 82 of the 571 block groups (~14%) would be classified as a hotspot. We decided on this threshold with the belief that a hotspot should represent a block group that meaningfully has an average higher temperature than the majority of the sample. Using standard deviation as a benchmark seemed to be an appropriate way to naturally divide the data.

dc_full_data <- left_join(landcover, acs_data, by = "block_group") 

Data Imputation Part 2

Imputation part 2: for any data points still missing after imputing at tract level, we impute at the ward level.

# Create ward-level dataframe
ward_data <- dc_full_data |>
  group_by(ward) |>
  summarize(
    rb_ward = mean(rent_burden_pct, na.rm = TRUE),
    hhi_ward = mean(med_hhi, na.rm = TRUE),
    propval_ward = mean(med_prop_val, na.rm = TRUE)
  ) |>
  st_drop_geometry()

# Replace missing data with ward data:
# Rent burden
dc_full_data <- dc_full_data |>
  left_join(ward_data, by = "ward") |>
  mutate(
    rent_burden_pct = if_else(
      is.na(rb_tract), rb_ward, rent_burden_pct),
    med_hhi = if_else(
      is.na(hhi_tract), hhi_ward, med_hhi),
    med_prop_val = if_else(
      is.na(propval_tract), propval_ward, med_prop_val)
    )
  
# Drop columns no longer needed
dc_full_data <- dc_full_data |>
  select(-c(rb_tract, rb_ward,
            hhi_tract, hhi_ward,
            propval_tract, propval_ward))

Finalize the Full Dataset

We create a variety of metrics using mutations of our imported data.

dc_full_data <- dc_full_data|>
  mutate(
    # convert total acres to sq miles
    total_sq_mi = total_ac * 0.0015625,
    # convert water acres to sq miles
    water_sq_mi = wat_ac * 0.0015625,
    # calculate water share of total sq miles
    share_water_sq_mi = water_sq_mi / total_sq_mi,
    # calculate population density per square mile
    pop_dens_sq_mi = total_pop / total_sq_mi,
    # calculate population density per water sq mile
    pop_density_water_sq_mi = ifelse(
      water_sq_mi !=0, total_pop / water_sq_mi, 0),
    # calculate ratio of housing units to population
    hous_units_per_person = ifelse(
      total_hous_units != 0, total_hous_units / total_pop, 0),
    # calculate ratio of vacant units to all units
    vac_units_share = ifelse(
      total_hous_units != 0, vac_units / total_hous_units, 0),
    # calculate ratio of total income to median home value
    inc_home_val_ratio = med_hhi / med_prop_val,
    # create combined HS diploma or equivalent variable
    ed_comb_hsd = ed_reg_hsd + ed_ged,
    # create combined advanced post-undergraduate variable
    ed_comb_adv_deg = ed_doct + ed_prof_deg + ed_master,
    # calculate ratio of 25+ pop with HSD to total pop
    ed_hsd_ratio = ifelse(
      ed_comb_hsd !=0, ed_comb_hsd / total_pop, 0),
    # calculate ratio of 25+ pop with bachelor's degree to total pop
    ed_bach_ratio = ifelse(
      ed_bach != 0, ed_bach / total_pop, 0),
    # calculate ratio of 25+ pop with advanced post-bach degree to total pop
    ed_adv_ratio = ifelse(
      ed_comb_adv_deg != 0, ed_comb_adv_deg / total_pop, 0),
    # calculate pop share white
    race_white_ratio = ifelse(
      race_white != 0, race_white / total_pop, 0),
    # calculate pop share black or african american
    race_black_ratio = ifelse(
      race_black != 0, race_black / total_pop, 0),
    # calculate pop share american indian or alaska native
    race_ai_an_ratio = ifelse(
      race_ai_an!= 0, race_ai_an / total_pop, 0),
    # calculate pop share asian
    race_asian_ratio = ifelse(
      race_asian != 0, race_asian / total_pop, 0),
    # calculate pop share native hawaiian or pacific islander
    race_nh_pi_ratio = ifelse(
      race_nh_pi != 0, race_nh_pi / total_pop, 0),
    # calculate pop share two or more races
    race_oth_mult_ratio = ifelse(
      race_oth_mult != 0, race_oth_mult / total_pop, 0),
    # calculate pop share hispanic or latino
    race_eth_hisp_ratio = ifelse(
      race_eth_hisp != 0, race_eth_hisp / total_pop, 0),
    # calculate total non-white population
    race_poc = race_black + race_ai_an + race_asian + race_nh_pi + race_oth_mult,
    # calculate pop share non-white
    race_poc_ratio = ifelse(
      race_poc != 0, race_poc / total_pop, 0),
    # creating hotspot var for decision tree model
    hotspot = if_else(
      uhi >= 91.6, "hotspot", "not hotspot"),
    # remove factor from uhi var
    uhi = as.numeric(as.character(uhi))) |>
  relocate(c("block_group_temp", "tract"), .after = 2) |>
  # remove columns with only one unique value
  select(-objectid,
         -name,
         -state,
         -statefp,
         -county,
         -countyfp,
         -tractce,
         -blkgrpce,
         -namelsad,
         -mtfcc,
         -funcstat,
         -field,
         -shapearea,
         -shapelen) |>
  relocate(race_white_ratio:race_eth_hisp_ratio,
           .after = race_eth_hisp) |>
  relocate(ed_comb_hsd:ed_adv_ratio,
           .after = ed_ged) |>
  relocate(hous_units_per_person:inc_home_val_ratio,
           .after = med_prop_val) |>
  relocate(total_sq_mi:pop_density_water_sq_mi,
           .after = inc_home_val_ratio) |>
  relocate(total_pop:hotspot, .after = tract)

Evaluating Imputation Via Visualization

Missing data have been imputed using the next smallest geography available. In order to assess the accuracy of these imputations, we plot the variables with imputed values against another variable with no imputed values, one which should have some correlation or general relation with the partially imputed variable. We can then see if the imputed values follow the general trend of the correlation or not.

Rent Burden imputation evaluation

Compared to the other partially-imputed variables, it was harder to find variables that were clearly correlated with rent burden (the percentage of households in a given tract spending 30% or more of their income on rent). However, to the extent that trends could be seen in visualizations of the data, our imputed values seem to follow those trends.

Rent burden vs. percent of population with a bachelor’s degree:

dc_full_data |> ggplot() +
  geom_point(mapping =
    aes(x = ed_bach_ratio, y = rent_burden_pct,
        color = as.character(rb_imputed)),
    alpha = 0.4
  ) +
  scale_color_manual(labels = c("No", "Yes"), values = c("#2271B5", "#8E133B")) +
  labs(title = "Evaluating Rent Burden Imputation",
       subtitle = "Comparison with % population with a Bachelor's degree",
       color = "Imputed?") +
  xlab("Proportion of Population with a Bachelor's Degree") +
  ylab("Proportion of Households with Rent Burden") +
  theme_minimal()

There is no strong trend here, but no imputed data points seem to fall outside the general shape of the data.

Rent burden vs. ward:

dc_full_data |> ggplot() +
  geom_point(mapping =
    aes(x = ward, y = rent_burden_pct,
        color = as.character(rb_imputed)),
    alpha = 0.4
  ) +
  scale_color_manual(labels = c("No", "Yes"), values = c("#2271B5", "#8E133B")) +
  labs(title = "Evaluating Rent Burden Imputation",
       subtitle = "Comparison with ward",
       color = "Imputed?") +
  xlab("Ward") +
  ylab("Proportion of Households with Rent Burden") +
  theme_minimal()

No imputed values fall outside the range for their ward. (Since a few imputed values were imputed at the ward level, we would not want to rely on this alone - but it provides additional reassurance alongside other comparisons.)

Median household income imputation evaluation

Median household income vs. people of color ratio:

dc_full_data |> ggplot() +
  geom_point(mapping =
    aes(x = race_poc_ratio, y = med_hhi,
        color = as.character(hhi_imputed)),
    alpha = 0.4
  ) +
  scale_color_manual(labels = c("No", "Yes"), values = c("#2271B5", "#8E133B")) +
  labs(title = "Evaluating Median Household Income Imputation",
       subtitle = "Comparison with POC % of population",
       color = "Imputed?") +
  xlab("Proportion of Population POC") +
  ylab("Median Household Income") +
  theme_minimal()

There is a clear negative correlation between median household income and POC ratio. The imputed data points seem to follow this trend. There is one datapoint where race_poc_ratio = 0 and imputed med_hhi looks to be somewhat too low, but since it’s only one data point, we are not too worried about systematic inaccuracy of the imputed values.

Median household income vs. percent of population with a bachelor’s degree:

dc_full_data |> ggplot() +
  geom_point(mapping =
    aes(x = ed_bach_ratio, y = med_hhi,
        color = as.character(hhi_imputed)),
    alpha = 0.4
  ) +
  scale_color_manual(labels = c("No", "Yes"), values = c("#2271B5", "#8E133B")) +
  labs(title = "Evaluating Median Household Income Imputation",
       subtitle = "Comparison with % of population with a bachelor's degree",
       color = "Imputed?") +
  xlab("Proportion of Population with a Bachelor's Degree") +
  ylab("Median Household Income") +
  theme_minimal()

There is less of a strong correlation here, but again, the imputed data points seem to fit the overall trend. There is an outlier in terms of ed_bach_ratio (= 1.00); arguably this should have had a larger value for med_hhi, but again, it is just one data point.

Median property value imputation evaluation

Median property value vs. POC ratio:

dc_full_data |> ggplot() +
  geom_point(mapping =
    aes(x = race_poc_ratio, y = med_prop_val,
        color = as.character(propval_imputed)),
    alpha = 0.4
  ) +
  scale_color_manual(labels = c("No", "Yes"), values = c("#2271B5", "#8E133B")) +
  labs(title = "Evaluating Median Property Value Imputation",
       subtitle = "Comparison with POC % of population",
       color = "Imputed?") +
  xlab("Proportion of Population POC") +
  ylab("Median Property Value") +
  theme_minimal()

There is a clear negative relationship between median property value and race_poc_ratio. The imputed data points seem to follow this trend.

Median property value vs. percent of population with a bachelor’s degree:

# Median property value vs. bachelor's degree ratio
dc_full_data |> ggplot() +
  geom_point(mapping =
    aes(x = ed_bach_ratio, y = med_prop_val,
        color = as.character(propval_imputed)),
    alpha = 0.4
  ) +
  scale_color_manual(labels = c("No", "Yes"), values = c("#2271B5", "#8E133B")) +
  labs(title = "Evaluating Median Property Value Imputation",
       subtitle = "Comparison with % of population with a bachelor's degree",
       color = "Imputed?") +
  xlab("Proportion of Population with a Bachelor's Degree") +
  ylab("Median Property Value") +
  theme_minimal()

The trend here is not as clear, but there seems to be a positive relationship between med_prop_val and ed_bach_ratio, and the imputed data points follow this trend. Again, we have one potentially suspicious data point for which ed_bach_ratio = 1.00 and imputed med_prop_val should perhaps be higher, but it’s a single data point.

After performing these analysis, we remove the imputation indicator variables:

dc_full_data <- dc_full_data |>
  select(-rb_imputed,
         -hhi_imputed,
         -propval_imputed)

Prepare the Data for Analysis (Seed Setting & Splitting)

# Set seed
set.seed(20241202)

# Split sample and create training and testing datasets
dc_split <- initial_split(data = dc_full_data, strata = ward, prop = 0.8)
dc_train <- training(x = dc_split) 
dc_test <- testing(x = dc_split)

Conduct Exploratory Data Analysis

Review Distribution of Median Household Income

# Median household income, continuous
dc_train |>
  ggplot() +
  geom_histogram(aes(x = as.numeric(med_hhi))
  ) +
  theme_minimal() +
  labs(title = paste0("Median Household Income Distribution in the District of",
                      " Columbia"),
       subtitle = "2020",
       caption = "\nSource: Census American Community Survey, 2016 - 2020") +
  xlab("Median Household Income") +
  ylab("Frequency") +
  scale_x_continuous(labels = scales::label_dollar())

A check of the distribution of the median household income variable does not reveal any red flags about the validity of the data.

Explore Temperature in Relation to Other Variables

dc_full_data |> ggplot() +
  geom_sf(
    mapping = aes(fill = uhi),
    color = "#400001"
    ) +
    scale_fill_gradient(
    low = "#FCF4B3",
    high = "#940000"
    ) +
  labs(
    title = "Average Evening Temperature (2018)",
    fill = "Fahrenheit",
    caption = paste0("Source: DC Open Data Urban Tree Canopy by Census",
                        "\nBlock Group, 2020, and Portland State University")) +
  theme_void()

This map generally demonstrates the average evening temperature of block groups. Notably, Rock Creek Park is a particular area that is relatively cooler compared to the rest of the city. While it was difficult to find information on when exactly SUPR logged temperatures, we found the Tacoma, WA’s version of this dataset on the city’s open data portal, whose description claims that temperatures were from July 2018. Thus, we assume that DC temperatures were also taken that summer.

Temperature & population:

# Total population
total_pop_eda <- dc_train |>
  ggplot(aes(x = total_pop, y = uhi, color = hotspot)) +
  geom_point(alpha = 0.5) +
  theme_minimal() +
  labs(title = "UHI and Total Population",
       subtitle = "By Block Group",
       caption = paste0("Source: Census American\nCommunity Survey, 2016-2020;",
                        "\nDC Open Data Urban Tree\nCanopy by Census Block",
                        "\nGroup, 2020, and\nPortland State University"),
       color = "Hotspot") +
  xlab("\nTotal Population") +
  ylab("UHI") +
  scale_x_continuous(labels = scales::label_comma())

# Population density
pop_density_eda <- dc_train |>
  ggplot(aes(x = pop_dens_sq_mi, y = uhi, color = hotspot)) +
  geom_point(alpha = 0.5) +
  theme_minimal() +
  labs(title = "UHI and Population Density",
       subtitle = "By Block Group",
       caption = paste0("Source: Census American\nCommunity Survey, 2016-2020;",
                        "\nDC Open Data Urban Tree\nCanopy by Census Block",
                        "\nGroup, 2020, and\nPortland State University"),
       color = "Hotspot") +
  xlab("\nPopulation per Square Mile") +
  ylab("UHI") +
  scale_x_continuous(labels = scales::label_comma())


# Non-white population density
poc_pop_eda <- dc_train |>
  ggplot(aes(x = race_poc_ratio, y = uhi, color = hotspot)) +
  geom_point(alpha = 0.5) +
  theme_minimal() +
    labs(title = "UHI and Non-White\nPopulation Density",
       subtitle = "By Block Group",
       caption = paste0("Source: Census American\nCommunity Survey, 2016-2020;",
                        "\nDC Open Data Urban Tree\nCanopy by Census Block",
                        "\nGroup, 2020, and\nPortland State University"),
       color = "Hotspot") +
  xlab("\nNon-White Population Proportion") +
  ylab("UHI")

total_pop_eda + pop_density_eda + poc_pop_eda 

Population variables do not indicate a strong relationship with UHI/temperature.

Temperature & income/education:

# Median household income
mhi_eda <- dc_train |>
  ggplot(aes(x = med_hhi, y = uhi, color = hotspot)) +
  geom_point(alpha = 0.5) +
  theme_minimal() +
  labs(title = "UHI and Median Household Income",
       subtitle = "By Block Group",
       caption = paste0("Source: Census American\nCommunity Survey, 2016-2020;",
                        "\nDC Open Data Urban Tree\nCanopy by Census Block",
                        "\nGroup, 2020, and\nPortland State University"),
       color = "Hotspot") +
  xlab("Household Income") +
  ylab("UHI") +
  scale_x_continuous(labels = scales::label_comma())

# Ratio of bachelors degrees
bach_eda <- dc_train |>
  ggplot(aes(x = ed_bach_ratio, y = uhi, color = hotspot)) +
  geom_point(alpha = 0.5) +
  theme_minimal() +
    labs(title = "UHI and College Degree Attainment",
       subtitle = "By Block Group",
       caption = paste0("Source: Census American\nCommunity Survey, 2016-2020;",
                        "\nDC Open Data Urban Tree\nCanopy by Census Block",
                        "\nGroup, 2020, and\nPortland State University"),
       color = "Hotspot") +
  xlab("Proportion of population with a bachelor's degree") +
  ylab("UHI") +
  scale_x_continuous(labels = scales::label_comma())

mhi_eda + bach_eda

Similarly, variables indicating economic status and educational attainment do not show strong relationship with UHI.

Temperature & landcover:

# Percentage area of urban tree canopy
utc_eda <- dc_train |>
  ggplot(aes(x = utc_pct, y = uhi, color = hotspot)) +
  geom_point(alpha = 0.5) +
  theme_minimal() +
  labs(title = "UHI and Urban Tree Canopy",
       subtitle = "By Block Group",
       caption = paste0("Source: Census American\nCommunity Survey, 2016-2020;",
                        "\nDC Open Data Urban Tree\nCanopy by Census Block",
                        "\nGroup, 2020, and\nPortland State University"),
       color = "Hotspot") +
  xlab("% Land Area Covered By Urban Tree Canopy") +
  ylab("UHI")
  
# Percentage area of impervious surfaces
imp_eda <- dc_train |>
  ggplot(aes(x = to_ia_pct, y = uhi, color = hotspot)) +
  geom_point(alpha = 0.5) +
  theme_minimal() +
  labs(title = "UHI and Impervious Surfaces",
       subtitle = "By Block Group",
       caption = paste0("Source: Census American\nCommunity Survey, 2016-2020;",
                        "\nDC Open Data Urban Tree\nCanopy by Census Block",
                        "\nGroup, 2020, and\nPortland State University"),
       color = "Hotspot") +
  xlab("% Land Area Covered By Impervious Surfaces") +
  ylab("UHI")

utc_eda + imp_eda

Meanwhile, landcover variables show a potentially strong relationship with UHI. Generally, it seems that as urban tree canopy area increases, UHI decreases. Conversely, as the percent area of impervious surfaces (e.g., roads, parking lots, etc.) increases, UHI also increases. This may indicate that landcover variables may hold more importance in our models compared to demographic ones.

Preparation for Modeling

# Prep the dataset for modeling
dc_train <- dc_train |>
  st_drop_geometry() |>
  select(
    # remove identifying vars
    -block_group,
    -intptlat,
    -intptlon,
    -gis_id,
    -globalid, gid,
    -tract,
    -ward,
    -block_group_temp,
    # remove raw total pop var (custom pop density and other pop proportion
    # vars remain)
    -total_pop,
    # remove raw race vars (custom proportion vars remain)
    -race_white,
    -race_black,
    -race_ai_an,
    -race_asian,
    -race_nh_pi,
    -race_oth_mult,
    -race_eth_hisp,
    # remove raw ed vars (custom proportion vars remain)
    -ed_doct,
    -ed_prof_deg,
    -ed_master,
    -ed_bach,
    -ed_assoc,
    -ed_reg_hsd,
    -ed_ged,
    -ed_comb_hsd,
    -ed_comb_adv_deg,
    # remove duplicative housing data (custom proportion vars and pct var
    # remain)
    -total_hous_units,
    -occ_units,
    -vac_units,
    # remove duplicative land data (custom proportion / "pct" vars remain)
    -ends_with("_ac"),
    -total_sq_mi,
    -water_sq_mi,
    -wtr_pct,
    -aland,
    -awater,
    -utcac0620,
    -utcac1120,
    -utcac1520,
    -utcpct0620,
    -utcpct1120,
    -utc_pct_06,
    -utc_pct_11,
    -utc_pct_15,
    # remove pop_density_water_sq_mi due to INF value for most records 
    -pop_density_water_sq_mi) 

# V-fold cross validation with 10 folds
dc_folds <- vfold_cv(data = dc_train, v = 10)

Decision Tree

Modeling

dt_recipe <- recipe(formula = hotspot ~ ., data = dc_train) |>
  # Remove uhi var to avoid duplication of data from hotspot
  step_rm(uhi) |>
  themis::step_downsample(hotspot) |> 
  step_nzv(all_numeric_predictors())

dt_model <- decision_tree() |>
  set_engine(engine = "rpart") |>
  set_mode(mode = "classification")

dt_workflow <- workflow() |>
  add_recipe(dt_recipe) |>
  add_model(dt_model)
  
dt_fitting <- dt_workflow |>
  fit(data = dc_train)

rpart.plot::rpart.plot(x = dt_fitting$fit$fit$fit,
                       roundint = FALSE)

The first decision point is whether the percentage of the block group that’s covered by tree canopy; if it’s greater or equal to 23% then it’s not a hotspot. If it is less than 23%, then you check how suitable the block group is for tree planting “based on a weighted formula that includes all planting prioritization categories.” If the score is less than 7, then it’s a not hotspot.

If it’s greater than or equal to seven, you check the percent of total area that’s covered by all combined impervious surfaces. If the percentage is less than 66%, then it’s not a hotspot. If it’s greater than or equal to 66%, then the final check is whether the number of housing units per person is less than 0.74. If it’s greater than or equal to 0.75, then it’s not a hotspot. If it is less than 0.74 then it is a hotspot.

Interestingly, the decision tree focuses on landcover variables. This is consistent with the EDA conducted that suggest that the temperature/uhi data have a notable relationship with landcover variables but not demographic variables. The last decision was a surprise because it initially seems counter-intuitive that more housing per person would lead to a block group not being a hotspot. Perhaps this is because more housing units per person mean that units are less densely occupied and therefore could contribute to cooler conditions.

Visualize Variables in the Decision Tree

canopy_map <- dc_full_data |> ggplot() +
  geom_sf(aes(fill = utc_pct)) +
    scale_fill_gradient(
    low = "#fcbc32",
    high = "#4e974b"
    ) +
  theme_void() +
  labs(
    title = "Percentage of Tree Canopy",
    subtitle = "By Block Group",
    fill = "Percent Area",
    caption = paste0("\nSource: DC Open Data\nUrban Tree Canopy by\nCensus Block",
                        " Group, 2020,\nand Portland State University\n\n"))

planting_map <- dc_full_data |> ggplot() +
  geom_sf(aes(fill = overall)) +
    scale_fill_gradient(
    low = "#e2e2e2",
    high = "#16501a"
    ) +
  theme_void() +
  labs(
    title = "Tree Planting Suitability Score",
    subtitle = "By Block Group",
    fill = "Score",
    caption = paste0("\nSource: DC Open Data\nUrban Tree Canopy by\nCensus Block",
                        " Group, 2020,\nand Portland State University\n\n"))

impervious_map <- dc_full_data |> ggplot() +
  geom_sf(aes(fill = to_ia_pct)) +
    scale_fill_gradient(
    low = "#ffffff",
    high = "#28292b"
    ) +
  theme_void() +
  labs(
    title = "Percentage Covered By\nImpervious Surfaces",
    subtitle = "By Block Group",
    fill = "Percent Area",
    caption = paste0("\nSource: DC Open Data\nUrban Tree Canopy by\nCensus Block",
                        " Group, 2020,\nand Portland State University"))

housing_map <- dc_full_data |> ggplot() +
  geom_sf(aes(fill = hous_units_per_person)) +
    scale_fill_gradient(
    low = "#d86400",
    high = "#0085d4"
    ) +
  theme_void() +
  labs(
    title = "Housing Units Per Person",
    subtitle = "By Block Group",
    fill = "# Housing Units",
    caption = paste0("Source: Census American\nCommunity Survey, 2016-2020"))

canopy_map + planting_map + impervious_map + housing_map

Evaluate the Model

Predict on the Testing Data

# Ensuring the testing dataset has the same variables as the training dataset
dc_test <- dc_test |>
  st_drop_geometry() |>
  select(
    # remove identifying vars
    -block_group,
    -intptlat,
    -intptlon,
    -gis_id,
    -globalid, gid,
    -tract,
    -ward,
    -block_group_temp,
    # remove raw total pop var (custom pop density and other pop proportion
    # vars remain)
    -total_pop,
    # remove raw race vars (custom proportion vars remain)
    -race_white,
    -race_black,
    -race_ai_an,
    -race_asian,
    -race_nh_pi,
    -race_oth_mult,
    -race_eth_hisp,
    # remove raw ed vars (custom proportion vars remain)
    -ed_doct,
    -ed_prof_deg,
    -ed_master,
    -ed_bach,
    -ed_assoc,
    -ed_reg_hsd,
    -ed_ged,
    -ed_comb_hsd,
    -ed_comb_adv_deg,
    # remove duplicative housing data (custom proportion vars and pct var
    # remain)
    -total_hous_units,
    -occ_units,
    -vac_units,
    # remove duplicative land data (custom proportion / "pct" vars remain)
    -ends_with("_ac"),
    -total_sq_mi,
    -water_sq_mi,
    -wtr_pct,
    -aland,
    -awater,
    -utcac0620,
    -utcac1120,
    -utcac1520,
    -utcpct0620,
    -utcpct1120,
    -utc_pct_06,
    -utc_pct_11,
    -utc_pct_15,
    # remove pop_density_water_sq_mi due to INF value for most records 
    -pop_density_water_sq_mi) 

# Creating predictions dataset
dt_predictions <- bind_cols(
  dc_test,
  predict(object = dt_fitting, new_data = dc_test),
  predict(object = dt_fitting, new_data = dc_test, type = "prob")
) 

# Editing column names so it's readable
names(dt_predictions) <- names(dt_predictions) |>
  str_replace(" ", "_")

# Printing a few predictions
dt_preds <- dt_predictions |>
  select(hotspot, .pred_class, .pred_hotspot, .pred_not_hotspot)

kable(dt_preds[1:20,])
hotspot .pred_class .pred_hotspot .pred_not_hotspot
not hotspot not hotspot 0.1162791 0.8837209
not hotspot not hotspot 0.1162791 0.8837209
not hotspot not hotspot 0.1162791 0.8837209
not hotspot not hotspot 0.1162791 0.8837209
not hotspot not hotspot 0.3750000 0.6250000
not hotspot not hotspot 0.1162791 0.8837209
not hotspot not hotspot 0.3750000 0.6250000
not hotspot not hotspot 0.1162791 0.8837209
not hotspot not hotspot 0.1162791 0.8837209
not hotspot not hotspot 0.1162791 0.8837209
not hotspot not hotspot 0.1162791 0.8837209
not hotspot not hotspot 0.1162791 0.8837209
not hotspot not hotspot 0.1162791 0.8837209
not hotspot not hotspot 0.2631579 0.7368421
not hotspot not hotspot 0.1162791 0.8837209
not hotspot not hotspot 0.1162791 0.8837209
not hotspot not hotspot 0.1162791 0.8837209
not hotspot not hotspot 0.1162791 0.8837209
not hotspot not hotspot 0.1162791 0.8837209
not hotspot not hotspot 0.1162791 0.8837209

Evaluate with Performance Metrics

Create the Confusion Matrix

# Transforming the hotspot variable to become a factor variable
dt_predictions$hotspot <- as.factor(dt_predictions$hotspot)

# Creating confusion matrix
conf_mat(data = dt_predictions,
         truth = hotspot,
         estimate = .pred_class)
             Truth
Prediction    hotspot not hotspot
  hotspot          16          13
  not hotspot       1          86

Calculate accuracy, precision, and recall/sensitivity:

kable(yardstick::accuracy(data = dt_predictions,
          truth = hotspot,
          estimate = .pred_class))
.metric .estimator .estimate
accuracy binary 0.8793103
kable(yardstick::precision(data = dt_predictions,
          truth = hotspot,
          estimate = .pred_class))
.metric .estimator .estimate
precision binary 0.5517241
kable(yardstick::recall(data = dt_predictions,
     truth = hotspot,
     estimate = .pred_class))
.metric .estimator .estimate
recall binary 0.9411765

Accuracy and recall are relatively high at 87.9% and 94.1%, respectively. This means that the model works relatively well in correctly classifying whether or not a block group is a hotspot and in specifically correctly identifying hotspots when that block group actually is a hotspost. Precision is not as high at 55.2%, since there were 13 instances in which the model predicted a block group is a hotspot when in reality it was not one. One possible way to improve the model is to adjust the threshold used to determine whether a block group is a hotspot or not. Another idea may be to examine the predictors used and adjust which ones should be included or not. For example, we made an assumption that using percentage area of the landcover variables rather than the total area variables (e.g., we chose percentage of area covered by canopy rather than total area covered by canopy) to allow consistency across block groups who vary in total area size. Perhaps it would be worth investigating the performance of this model using the total area variables instead or even alongside the percentage variables.

Linear Regression and Random Forest Models

Create a Recipe for the Regression and Random Forest Models

# Create recipe for use in linear regression and random forest models
dc_recipe <- recipe(formula = uhi ~ ., data = dc_train) |>
  # remove duplicative hotspot 
  step_rm(hotspot) |>
  # Removing predictors that have near zero variability
  step_nzv(all_predictors()) |>
  # Removing predictors that are highly correlated with others
  step_corr(all_predictors())

Linear Regression

Modelling

# Linear regression model
dc_lm_model <- 
  linear_reg() |>
  set_mode(mode = "regression") |>
  set_engine(engine = "lm")

# Linear regression workflow
dc_lm_workflow <- 
  workflow() |>
  add_recipe(recipe = dc_recipe) |>
  add_model(spec = dc_lm_model)

# Setting up resamples for later evaluation
dc_lm_resamples <- dc_lm_workflow |>
  fit_resamples(resamples = dc_folds)

Random Forests

Modelling

Hyperparameter Tuning

dc_rf_model <- 
  rand_forest(
  trees = 200,
  mtry = tune(),
  min_n = tune()) |>
  set_mode(mode = "regression") |>
  set_engine(
    engine = "ranger", 
    importance = "impurity",
    num.threads = 4
  )

# Create a parameter grid
dc_rf_grid <- grid_regular(mtry(range = c(1, 15)),
                           min_n(range = c(1, 15)),
                           levels = 5)

# Random forest workflow
dc_rf_workflow <- 
  workflow() |>
  add_recipe(dc_recipe) |>
  add_model(dc_rf_model)

# Creating a grid to show results of hyper parameter tuning
dc_rf_grid <- grid_regular(
  mtry(range = c(1, 15)),
  min_n(range = c(1, 15)),
  levels = 5
)

# Tuning the grid
rf_resamples <- tune_grid(
  dc_rf_workflow,
  resamples = dc_folds,
  grid = dc_rf_grid
)

# Showing the best hyperparameters based on model performance
kable(show_best(rf_resamples))
mtry min_n .metric .estimator mean n std_err .config
15 8 rmse standard 0.8899883 10 0.0338497 Preprocessor1_Model15
15 1 rmse standard 0.8913224 10 0.0317014 Preprocessor1_Model05
8 4 rmse standard 0.8925670 10 0.0324308 Preprocessor1_Model08
11 1 rmse standard 0.8927559 10 0.0329127 Preprocessor1_Model04
8 1 rmse standard 0.8928784 10 0.0314905 Preprocessor1_Model03

Based on hyperparameter tuning and the most recent rendering of this document, the best random forest model uses mtry = 15 and min_n = 8. This means it has selected 15 predictors to be randomly sampled when splitting before each tree model and 8 minimum data points needed to do further splits.

Implementing best hyperparameters based on tuning

# Filling in hyperparameters based on tuning
dc_rf_model <- rand_forest(
  trees = 200,
  mtry = 15,
  min_n = 8
) |>
  set_mode(mode = "regression") |>
  set_engine(
    engine = "ranger", 
    importance = "impurity",
    num.threads = 4
  )

# Random forest workflow
dc_rf_workflow <- 
  workflow() |>
  add_recipe(dc_recipe) |>
  add_model(dc_rf_model)

rf_resamples <- dc_rf_workflow |>
  fit_resamples(resamples = dc_folds)

Compare the Linear Regression and Random Forest Models

Plot the RMSEs across each fit:

# Linear Regression: Calculate and plot the RMSE for each resample
kable(collect_metrics(
  dc_lm_resamples, 
  summarize = FALSE) |>
  filter(.metric == "rmse"))
id .metric .estimator .estimate .config
Fold01 rmse standard 0.0004106 Preprocessor1_Model1
Fold02 rmse standard 0.0005899 Preprocessor1_Model1
Fold03 rmse standard 0.2155494 Preprocessor1_Model1
Fold04 rmse standard 0.0006500 Preprocessor1_Model1
Fold05 rmse standard 0.0005780 Preprocessor1_Model1
Fold06 rmse standard 0.0006337 Preprocessor1_Model1
Fold07 rmse standard 0.0006282 Preprocessor1_Model1
Fold08 rmse standard 0.1740280 Preprocessor1_Model1
Fold09 rmse standard 0.0006290 Preprocessor1_Model1
Fold10 rmse standard 0.0007083 Preprocessor1_Model1
collect_metrics(
  dc_lm_resamples, 
  summarize = FALSE) |>
  filter(.metric == "rmse") |>
  ggplot(mapping = aes(x = id, y = .estimate, group = .metric)) +
  geom_point() +
  geom_line() +
  theme_void() +
  labs(
    title = "RMSE of Linear Regression Model Fits Across Folds") +
  xlab("Fold") +
  ylab("RMSE")

# Random Forests: Calculate and plot the RMSE for each resample
kable(rf_resamples |>
  collect_metrics(summarize = FALSE) |>
  filter(.metric == "rmse"))
id .metric .estimator .estimate .config
Fold01 rmse standard 0.7839339 Preprocessor1_Model1
Fold02 rmse standard 0.9487524 Preprocessor1_Model1
Fold03 rmse standard 0.8090248 Preprocessor1_Model1
Fold04 rmse standard 0.8965515 Preprocessor1_Model1
Fold05 rmse standard 0.8107504 Preprocessor1_Model1
Fold06 rmse standard 0.7601461 Preprocessor1_Model1
Fold07 rmse standard 1.0226176 Preprocessor1_Model1
Fold08 rmse standard 0.8536072 Preprocessor1_Model1
Fold09 rmse standard 1.0213467 Preprocessor1_Model1
Fold10 rmse standard 1.0053431 Preprocessor1_Model1
rf_resamples |>
  collect_metrics(summarize = FALSE) |>
  filter(.metric == "rmse") |>
  ggplot(mapping = aes(x = id, y = .estimate, group = .metric)) +
  geom_point() +
  geom_line() +
  theme_void() +
  labs(
    title = "RMSE of Random Forest Model Fits Across Folds") +
  xlab("Fold") +
  ylab("RMSE")

# Compare average rmse
kable(bind_rows(
  "Linear regression" = collect_metrics(dc_lm_resamples) |>
    filter(.metric == "rmse"), 
  'Random forest' = collect_metrics(rf_resamples) |>
    filter(.metric == "rmse"),
  .id = "model"))
model .metric .estimator mean n std_err .config
Linear regression rmse standard 0.0394405 10 0.0260757 Preprocessor1_Model1
Random forest rmse standard 0.8912074 10 0.0323003 Preprocessor1_Model1

The linear regression model outperforms the random forests model significantly, based on a comparison of the average root mean squared error of each model.

Implement the Final Model

# Select the best resamples via select_best
lm_best <- dc_lm_resamples |>
  select_best(metric = "rmse")

# Finalize the workflow with finalize_workflow
lm_final <- finalize_workflow(
  dc_lm_workflow,
  parameters = lm_best
)

# Fit the model on all the data
lm_final_fit <- 
  lm_final |>
  last_fit(dc_split) 

# Collect metrics for RMSE
kable(lm_final_fit |>
  collect_metrics() |>
  filter(.metric == "rmse"))
.metric .estimator .estimate .config
rmse standard 0.0006264 Preprocessor1_Model1
# Make predictions with the testing data
lm_predictions <- bind_cols(
  dc_test,
  predict(object = extract_workflow(lm_final_fit), 
          new_data = dc_test) 
)

# Display the first 10 predictions against the actual values
kable(select(lm_predictions, uhi, starts_with(".pred")))
uhi .pred
88.46352 88.46327
84.07872 84.07895
89.18240 89.18221
90.83478 90.83565
91.49319 91.49289
87.98800 87.98787
90.91855 90.91851
91.22996 91.22987
86.75634 86.75600
91.28605 91.28467
88.21297 88.21425
89.02017 89.02014
90.88171 90.88170
91.24494 91.24481
91.11902 91.11910
86.09227 86.09214
90.41321 90.41312
88.43620 88.43472
88.88523 88.88529
87.96607 87.96593
91.79575 91.79579
87.62974 87.62993
87.31011 87.31014
89.10649 89.10658
85.47182 85.47092
88.43828 88.43847
91.14506 91.14470
87.59863 87.59818
91.16683 91.16666
91.55821 91.55804
91.65225 91.65212
90.45383 90.45434
89.13717 89.13717
92.78224 92.78133
91.34223 91.34092
90.43526 90.43529
90.55059 90.55063
88.10257 88.10274
88.57215 88.57233
87.30355 87.30255
92.88210 92.88055
93.01053 93.01037
91.20824 91.20828
92.33900 92.33915
91.44217 91.44088
91.31380 91.31251
90.08510 90.08529
90.31454 90.31472
90.34040 90.34050
92.71677 92.71653
91.12966 91.13089
91.70199 91.70101
91.27360 91.27245
91.12722 91.12699
89.90627 89.90664
91.11053 91.10946
91.91128 91.91105
90.12384 90.12395
89.17259 89.17246
92.50475 92.50473
90.61079 90.61050
90.95426 90.95404
90.72494 90.72457
90.45071 90.45214
92.04979 92.04977
93.05506 93.05488
92.57818 92.57796
90.46372 90.46361
90.89459 90.89450
90.60280 90.60261
90.94285 90.94233
92.94851 92.94808
90.06922 90.06777
92.66947 92.66875
92.17783 92.17794
91.43008 91.43021
90.58809 90.58815
89.93667 89.93793
91.61064 91.61079
89.73022 89.72999
90.49735 90.49598
90.67810 90.67779
89.64401 89.64401
90.98889 90.98766
90.37919 90.38026
90.13498 90.13510
90.02812 90.02827
87.61195 87.61294
90.43686 90.43564
89.14399 89.14393
90.03627 90.03631
90.26955 90.26962
89.64193 89.64204
89.54497 89.54488
90.85520 90.85521
90.12619 90.12615
89.77884 89.77882
90.04584 90.04449
89.17864 89.17837
88.87318 88.87332
88.78104 88.78113
89.06382 89.06405
89.34039 89.34039
87.23057 87.23009
87.73870 87.73839
86.04946 86.04936
86.38000 86.38147
87.32229 87.32257
88.43309 88.43256
86.42017 86.41998
86.83332 86.83465
87.54171 87.54164
88.54941 88.54935
87.90347 87.90457
88.84667 88.84678
88.56152 88.56154
# Interrogate which values were incorrectly predicted to the nearest tenth
kable(lm_predictions |>
  select(uhi, .pred) |>
  mutate(.pred = round(lm_predictions$.pred, 1),
         uhi = round(lm_predictions$uhi, 1),
         correct = if_else(.pred == uhi, "Yes", "No"
         )) |>
  count(correct))
correct n
Yes 116

We were suprised to observe that the linear regression model predicted all UHI values, rounded to the nearest tenth. Before such a model can be implemented, further investigation of this extremely close estimation is needed. However, as we required the model to predict to larger decimals, its performance declined. It predicted 109 values (~94%) to the nearest hundredth correctly, and 73 values (~63%) to the nearest thousandth correctly.

Visualize the Most Important Predictors

lm_final_fit |>
  extract_fit_parsnip() |>
  vip(num_features = 20) +
  labs(title = "Top Predictors in Linear Regression Model") +
  ylab("Importance") +
  xlab("Variable")

Interestingly, the model did not identify any Census American Community Survey variables as key predictors of UHI in the District of Columbia. From this result, we concluded that demographic variables are only related to UHI through their relationships to the landcover variables. Of the 54 predictors fed into each model, the linear regression model identified seven key landcover variables as the most important predictors of UHI:
* OVERALL: The overall suitability for tree planting score of a given block group.
* WLDLF_CON: Available planting areas within 100’ of large canopy tracts (equal to or greater than 5 acres).
* ENRGY_CON: Residentially zoned areas with less than the district average tree cover and higher than average total possible planting area.
* LAND_OWN: Possible planting area on District owned operated or managed public & public/private land. Values are the percentage of plantable space within public land.
* POS_UTC: Positive urban tree canopy.
* STRM_WTR: Uses available planting area on, and adjacent to, impervious surfaces and surface water bodies to indicate areas where trees will more directly benefit stormwater runoff mitigation.
* TO_UN_PCT: The percentage of land area in the assessment boundary that is considered unsuitable for planting. This includes buildings and roads, barren soil and dry vegetation, as well as sports fields, golf course playing areas, airports, agricultural land, wetlands and any other land identified as unsuitable.

In practice, before implementing the model, we would need to explore the definitions of all landcover variables as well as their data collection context more closely with the District. Little information is available publicly on each variable beyond a set of basic definitions. Further, many of these top variables seem to be plausibly related to each other, even though we included step_corr() in our recipe. This warrants further investigation of how step_corr() removes correlated variables and the relationships among the remaining variables.

Practical Use of the Model

We envision that the DC government could leverage the decision tree and/or linear regression models over the years as demographics and land use changes over time. For example, if they have updated data for the predictors for some block groups, the government could run these models to see which block groups may be facing disproportionately high temperatures when Summer comes around. Thus, the government could priortize projects in these areas aimed at improving temperature cooling during hot months. The model could also be used by local nonprofit groups that also provide services to local communities, though these nonprofits should do so in partnership with DC government, who has more contextual information about the landcover variables.

If the DC government is interested in using or adapting this model, we do have some recommendations for future improvement. One would to use more recent temperature data before running the model on current/updated landcover variables; we would hope that the government may be able to access more viable remote sensing data. They should then also use more update Census demographic data as well. If the government may look to improve the models built, we would also recommend being more select with the landcover variables. The public metadata does not offer a lot of information about each variable, whereas the DC government likely has greater context about which variables may be duplicative and should be excluded.

Sources

Arizona State Climate Office. “Urban Heat Island.” Accessed December 17, 2024. https://globalfutures.asu.edu/azclimate/urban-heat-island/.

“SUPR Lab Research.” Accessed December 17, 2024. https://climatecope.research.pdx.edu/.

“Urban Heat Island/UHI Index 2018 (Portland State University) | Tacoma Open Data.” Accessed December 17, 2024. https://data.cityoftacoma.org/datasets/tacoma::urban-heat-island-uhi-index-2018-portland-state-university/about.

“Urban Tree Canopy by Census Block Group in 2020.” Vector digital data, April 5, 2022. https://opendata.dc.gov/datasets/DCGIS::urban-tree-canopy-by-census-block-group-in-2020/about.

US Census Bureau. “American Community Survey: 5-Year Estimates: Detailed Tables 5-Year.” Accessed December 17, 2024. https://api.census.gov/data/2020/acs/acs5.html.

US Census Bureau. “American Community Survey: Multiyear Accuracy of the Data (5-year 2016-2020).” Accessed December 17, 2024. https://www2.census.gov/programs-surveys/acs/tech_docs/accuracy/MultiyearACSAccuracyofData2020.pdf

“Notes on ACS Estimate and Annotation Values.” Census.gov. Accessed December 17, 2024. https://www.census.gov/data/developers/data-sets/acs-1year/notes-on-acs-estimate-and-annotation-values.html.

US EPA, OAR. “What Are Heat Islands?” Overviews and Factsheets, June 17, 2014. https://www.epa.gov/heatislands/what-are-heat-islands.